Does the smelliness of your city impact the kind of perfume you wear?
Author
Olivia Li and Riya Sinha
Introduction
We are trying to explore whether the smelliness of the (US) city impacts the kind of perfume the dwellers wear. This project integrates perfume sales data from eBay with air quality metrics to analyze purchasing patterns across US cities. We have broken this topic up into two separate goals:
Determine whether there is a noticeable difference in sales between perfume longevity among cities with higher/lower air pollution
Link perfume notes to cities with higher or lower air pollution
Data Sources:
We will be looking at three data sources from different data sources found on Kaggle:
A dataset of global perfume brands, notes, types, and longevity (Perfume Dataset)
“This dataset was synthetically generated and curated with the assistance of AI (ChatGPT, OpenAI GPT-5). It was inspired by public fragrance resources such as Fragrantica, Basenotes, brand catalogs, and community reviews but does not reproduce copyrighted or proprietary content. It is designed for educational and research purposes only, providing a structured reference for fragrance analysis, experimentation, and machine learning projects”
A dataset of perfume listings on eBay shown with items sold and where they are located (estimating the most popular perfumes in a city) (e-Commerce Dataset)
“This dataset has been ethically mined from eBay using the Apify API, ensuring compliance with ethical data collection practices.”
A dataset of AQI (Air Quality Index) by city to determine their “smelliness” (Global Pollution Dataset)
“These datas are collected by using web scraping and slightly preprocessed by using hand engineering. These datas are collected from elichens.com”
Issues with the Data Sources
As the data sources were all csv files, there were no issues in importing and reading into R Studio as well as Power Query.
Transformations:
This analysis will require us to perform multiple joins to get the final table that includes the perfume names, brands, units sold, location/city sold to, and city AQI level. Additionally, we will need to utilize different techniques to clean and run diagnostics on the data.
We began by filtering the Global Pollution data set to narrow down the US Cities and their corresponding AQI Values. We then concatenated men and women’s eBay Perfume eCommerce data. In order to match this with the US Cities’ pollution data, we needed to split the itemLocation by its delimiter and parse out the irrelevant, duplicate, or redundant location names, of which there were many.
To join the unstructured eBay perfume titles (1 Million by Paco Rabanne 3.4 Fl oz / 100 ml PARFUM Spray Men’s New & Sealed) with the structured perfume attribute data (name, brand, type, longevity, category), we utilized AI to create an intermediary matching table with a 0.6 confidence threshold. The final dataset merged these three sources—sales, pollution, and perfume characteristics—through left joins, followed by removal of null values to create a clean dataset ready for analysis of how perfume preferences correlate with air quality levels across different US cities.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.1 ✔ stringr 1.5.2
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::ident() masks dbplyr::ident()
✖ dplyr::lag() masks stats::lag()
✖ dplyr::sql() masks dbplyr::sql()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(reticulate)options(duckdb.enable_rstudio_connection_pane=TRUE)py<-import_main()#py_install("pandas", pip = TRUE)#py_module_available("pandas")# create / connect to in memory database (not stored in a file)drv <-duckdb()con <-dbConnect(drv)
py_global_pollution_diag = (py_global_pollution .groupby(['City', 'Country']) .agg(dupe_count=('City', 'size')) #count rows of duplicates after group by .sort_values('dupe_count', ascending=False) .reset_index() )
py_global_pollution_diag.shape
(23035, 3)
py_global_pollution.shape
(23463, 12)
SELECT Country, COUNT (DISTINCT City) AS num_citiesFROM global_pollutionGROUPBY Country--427 cities identified with no corresponding countryHAVING Country ='United States of America'-- 2872 distinct cities is the US
We ran diagnostics by first grouping by City to find the number of times each city appears in the global pollution dataset. We then filtered to see if any City appears more than once. And then we counter the total number of cities.
py_us_pollution = (py_global_pollution .query("Country == 'United States of America'"))py_us_pollution.loc[:,'City'] = py_us_pollution['City'].str.lower()
py_us_pollution.shape
(2872, 12)
py_us_pollution.info() #checking that no null values were included
<class 'pandas.core.frame.DataFrame'>
Index: 2872 entries, 5 to 23461
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country 2872 non-null object
1 City 2872 non-null object
2 AQI Value 2872 non-null int64
3 AQI Category 2872 non-null object
4 CO AQI Value 2872 non-null int64
5 CO AQI Category 2872 non-null object
6 Ozone AQI Value 2872 non-null int64
7 Ozone AQI Category 2872 non-null object
8 NO2 AQI Value 2872 non-null int64
9 NO2 AQI Category 2872 non-null object
10 PM2.5 AQI Value 2872 non-null int64
11 PM2.5 AQI Category 2872 non-null object
dtypes: int64(5), object(7)
memory usage: 291.7+ KB
CREATEORREPLACETABLE sql_us_pollution ASSELECT Country,LOWER(City) AS City,"AQI Value","AQI Category","CO AQI Value","CO AQI Category","Ozone AQI Value","Ozone AQI Category","NO2 AQI Value","NO2 AQI Category","PM2.5 AQI Value","PM2.5 AQI Category"FROM global_pollutionWHERE Country ='United States of America'
Ensuring that there are no duplicate cities within the us pollution data set. The set has 2872 distinct cities and that matches with the 2872 row count.
SELECTCOUNT(DISTINCT City) AS city_num, COUNT(*) AS row_numFROM sql_us_pollution
1 records
city_num
row_num
2872
2872
r_us_pollution <- r_global_pollution |>filter(Country =="United States of America") |>mutate(City=tolower(City))nrow(r_us_pollution)
[1] 2872
Filtered Country column in pq_global_pollution to only include “United States of America,” and then renamed the query to pq_us_pollution.
Transforming City column to lowercase. Checking that no null values are included. We changed all City values to lowercase for consistency
No null values included.
Merge with eBay datasets (men and women), and unnest itemLocation
Stacking men and women data sets on top of each other
py_long_ebay= (py_ebay .assign(City =lambda py_ebay_df_: py_ebay_df_['itemLocation'] .str.split(',') .apply(lambda x: [item.strip() for item in x])) #removing leading and trailing white space .drop(columns = ['itemLocation']) .explode('City') .filter(items = ['title','sold','City' ]))#making City and title entries all lowercase before joiningpy_long_ebay[['City', 'title']] = py_long_ebay[['City', 'title']].apply(lambda x: x.str.lower())
py_long_ebay.shape
(5897, 3)
CREATEORREPLACETABLE sql_long_ebay ASSELECTLOWER(title) AS title, sold,LOWER( --standardizing case on itemLocation to match pythonTRIM( --trimming white space UNNEST( STRING_SPLIT(REPLACE(itemLocation, ', ', ','), --standardizing delimiters from ", " to ","',')))) AS City --Renaming this column so that we can use USING later in the join--because I don't like having duplicate columnsFROM ebay_womenUNIONALLSELECTLOWER(title) AS title, sold,LOWER( TRIM( UNNEST( STRING_SPLIT(REPLACE(itemLocation, ', ', ','), --standardizing delimiters from ", " to ","',')))) AS CityFROM ebay_men
SELECTCOUNT(*) AS num_rowsFROM sql_long_ebay
1 records
num_rows
5897
r_long_ebay <-bind_rows (r_ebay_mens, r_ebay_womens) |>#add men and women ebay datasets together on top of each othermutate(itemLocation =str_replace_all(itemLocation, ", ", ",")) |>#standardize the delimiters to "," before separatingseparate_longer_delim(itemLocation, delim =",") |>#unnesting itemLocation and make it longer so #then we can join on City and itemLocation #this also trims white space select(title, sold, itemLocation) |>mutate(itemLocation =tolower(itemLocation),title =tolower(title))nrow(r_long_ebay)
[1] 5897
Appended pq_ebay_mens to pq_ebay_womens.
Filtered out columns to only include title, sold, itemLocation
standardized delimiters to ‘,’
Unnesting itemLocation to make it longer
Filtering out perfumes that were not sold Changing itemLocation and title to lowercase for consistency
Join us_pollution and clean_ebay on cities
Some of the itemLocations from r_ebay included places outside of the US like Hong Kong. Order/formatting was not always consistent Ex: New Jersey, Hong Kong Ex: Orange, New Jersey, USA Ex: USA, New Jersey, Hong Kong. We did an inner join with the us_pollution to cut out any states and countries outside of the US.
Working in R, we were confused since the SQL data set returned a different number of rows here initially compared to the R output…Did we have a duplicate or something? We then checked that there were no duplicate titles.
This confused us even more. How is the number of distinct titles so much less than our output?? Our hypothesis was that some titles were duplicated when we made the table longer and joined on City after delimiting itemLocation Ex: New York, New York, USA was then split into 3 rows with locations New York New York USA We had inner joined on City = itemLocation the pollution and ebay data hoping that only the city names would match and be kept However, this made no distinction between New York City and New York (state) Therefore, the titles were duplicated for those entries.
Then we needed to get rid of the duplicates and only keep the unique titles.
We were able to refine the output to 1451 perfumes with no duplicates. This is how we did a quick check across all languages to ensure nothing was wrong.
py_perfume_sales_by_city= (py_us_pollution .merge(py_long_ebay, how ='inner', on ='City') .drop(columns = ['Country']) #dropping Country since these are all US cities#using py_perfume_sales_by_city.info() we found#there are some null values under "sold" .dropna() #getting rid of null values .sort_values(['title', 'City', 'sold']) #sort to make dropping duplicates command deterministic #across languages .drop_duplicates(subset='title') #getting rid of duplicate perfume titles .reset_index(drop=True) )
py_perfume_sales_by_city.shape
(1451, 13)
py_perfume_sales_by_city.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1451 entries, 0 to 1450
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 City 1451 non-null object
1 AQI Value 1451 non-null int64
2 AQI Category 1451 non-null object
3 CO AQI Value 1451 non-null int64
4 CO AQI Category 1451 non-null object
5 Ozone AQI Value 1451 non-null int64
6 Ozone AQI Category 1451 non-null object
7 NO2 AQI Value 1451 non-null int64
8 NO2 AQI Category 1451 non-null object
9 PM2.5 AQI Value 1451 non-null int64
10 PM2.5 AQI Category 1451 non-null object
11 title 1451 non-null object
12 sold 1451 non-null float64
dtypes: float64(1), int64(5), object(7)
memory usage: 147.5+ KB
CREATEORREPLACETABLE sql_perfume_sales_by_city ASSELECTDISTINCTON (title) *FROM sql_long_ebayINNERJOIN sql_us_pollution USING (City)WHERE sold ISNOTNULLORDERBY title, City, sold --Sorting to make deterministic for discrepancies between languages
SELECTCOUNT(*)FROM sql_perfume_sales_by_city
1 records
count_star()
1451
r_perfume_sales_by_city <-inner_join(r_us_pollution, r_long_ebay, by =join_by(City == itemLocation)) |>drop_na() |>arrange(title, City, sold) |>#need to sort by title to keep results deterministic with pythondistinct(title, .keep_all =TRUE)
nrow(r_perfume_sales_by_city)
[1] 1451
inner join on itemLocation and City
expanding the table
sorting table by title to keep it consistent with python
removing duplicate titles
verifying final number of rows
Join perfume_sales_by_city with perfumes data set
Now we need to join the perfume_sales_by_city with the perfumes data set (covers perfume notes and intensities) on perfume title. Because of the extreme variety in the ebay title listings, we would not be able to join the two tables on their own. We utilized Claude AI to create an intermediary table so that we could match the two columns we would be joining on. The “perfume” column would be the controlled text whereas the “title” would be the uncontrolled free text where they may not be any matches. After reviewing the results in random samples, we settled on a final 0.6 confidence threshold.
#fixing case to be consistent across tablespy_intermediary[['ebay_title', 'perfume']] = py_intermediary[['ebay_title', 'perfume']].apply(lambda x: x.str.lower())py_perfumes['perfume'] = py_perfumes['perfume'].str.lower()py_perfume_sales_by_city['title'] = py_perfume_sales_by_city['title'].str.lower()#left join with py_perfume_sales_by_city and py_intermediary on titlepy_final = (py_intermediary .rename(columns = {'ebay_title': 'title'}) #renaming to match column between tables .merge(py_perfume_sales_by_city, how ='left', on ='title') .merge(py_perfumes, #merge perfumes how ='left', on ='perfume') .dropna() #a lot of null values where things weren't #sold/located .sort_values(['title', 'City', 'sold']) .drop_duplicates(subset=['title']) .reset_index() .filter(items = ['perfume', 'City', 'AQI Value', 'AQI Category', 'sold', 'brand', 'category', 'longevity']) .replace({'longevity': {"6–8 hours": "Strong"} }) )
print(py_final)
perfume City ... category longevity
0 polo eau de toilette brooklyn ... Woody Aromatic Medium
1 polo eau de toilette brooklyn ... Woody Aromatic Medium
2 1 million eau de toilette warren ... Woody Spicy Medium
3 1 million eau de toilette hackensack ... Woody Spicy Medium
4 1 million lucky hackensack ... Woody Fruity Strong
.. ... ... ... ... ...
429 mon paris edp sacramento ... Floriental Medium
430 black opium edp fort wayne ... Ambery Gourmand Strong
431 y edp brooklyn ... Aromatic Fougere Strong
432 y edp sacramento ... Aromatic Fougere Strong
433 y edp sacramento ... Aromatic Fougere Strong
[434 rows x 7 columns]
--using CTECREATEORREPLACETABLE sql_final AS--First joining intermediary and perfume_sales_by_cityWITH sql_med AS (SELECTLOWER(intermediary.ebay_title) AS ebay_title,LOWER(intermediary.perfume) AS perfume, sql_perfume_sales_by_city.*FROM intermediary LEFTJOIN sql_perfume_sales_by_city ONLOWER(intermediary.ebay_title) =LOWER(sql_perfume_sales_by_city.title))SELECTDISTINCTON (ebay_title) -- Now querying from new sql_med table for second left join sql_med.perfume, City, "AQI Value","AQI Category", sold, perfumes.brand,type,category,CASEWHEN longevity ='6–8 hours'THEN'Strong'ELSE longevityENDAS longevityFROM sql_med LEFTJOIN perfumes ONLOWER(sql_med.perfume) =LOWER(perfumes.perfume)WHERE sold ISNOTNULL--getting rid of null values that would resultAND sql_med.perfume ISNOTNULL--from some perfumes not being sold to certain placesAND City ISNOTNULLAND perfumes.brand ISNOTNULLANDcategoryISNOTNULLAND longevity ISNOTNULLORDERBY ebay_title, City, sold
SELECT*FROM sql_final
Displaying records 1 - 10
perfume
City
AQI Value
AQI Category
sold
brand
type
category
longevity
polo eau de toilette
brooklyn
35
Good
14
Ralph Lauren
EDT
Woody Aromatic
Medium
polo eau de toilette
brooklyn
35
Good
5
Ralph Lauren
EDT
Woody Aromatic
Medium
1 million eau de toilette
warren
50
Good
2
Paco Rabanne
EDT
Woody Spicy
Medium
1 million eau de toilette
hackensack
117
Unhealthy for Sensitive Groups
8877
Paco Rabanne
EDT
Woody Spicy
Medium
1 million lucky
hackensack
117
Unhealthy for Sensitive Groups
33
Paco Rabanne
Parfum
Woody Fruity
Strong
1 million eau de toilette
warren
50
Good
19
Paco Rabanne
EDT
Woody Spicy
Medium
212 men nyc
lincoln park
66
Moderate
207
Carolina Herrera
EDT
Woody Spicy
Medium
212 vip
hackensack
117
Unhealthy for Sensitive Groups
3691
Carolina Herrera
EDP
Woody Spicy
Medium
acqua di giò edt
thomasville
61
Moderate
111
Giorgio Armani
EDT
Aquatic Aromatic
Medium
acqua di giò edt
miami
55
Moderate
235
Giorgio Armani
EDT
Aquatic Aromatic
Medium
Comparing SQL and Python Final table with anti-join to see if all entries are the same.
sql_final <-tbl(con, "sql_final") |>collect()anti_join(sql_final, py$py_final, by =c("perfume", "City")) |>select(perfume, City, sold)
anti_join(py$py_final, sql_final, by =c("perfume", "City")) |>select(perfume, City, sold)
#fix case on dataframesr_intermediary <- r_intermediary |>mutate(ebay_title =tolower(ebay_title),perfume =tolower(perfume))r_perfumes <- r_perfumes |>mutate(perfume =tolower(perfume))# Left join the perfume_sales_by_city with the perfumes data setr_final <- r_intermediary |>rename(title = ebay_title) |># Renaming to match column between tablesleft_join(r_perfume_sales_by_city, by ="title", relationship ="many-to-many") |># "many to many" to account for perfumes # being sold in multiple cities, and having# multiple attributesselect(-brand, -match_confidence, -match_quality) |># Drop columnsleft_join(r_perfumes, by ="perfume", relationship ="many-to-many") |># Merge perfumesdrop_na() |># Drop null valuesarrange(title, City, sold) |>distinct(title, .keep_all =TRUE) |># Drop duplicates based on titleselect(perfume, City, 'AQI Value', 'AQI Category', sold, category, longevity) |>mutate (longevity =recode(longevity, "6–8 hours"="Strong"))nrow(r_final)
[1] 434
#Checking to see that datasets have the same rows across all languagesanti_join(r_final, py$py_final, by =join_by(City))
anti_join(py$py_final, r_final, by =join_by(City))
Fixing case on intermediary table fixing case on perfumes table using inner joing to merge pq_intermediary with pq_perfume_sales_by_city
Expanding the table to include relevant information
Filtering out rows with no value in perfume column
Using inner join to merge the previous inner join with pq_perfumes
Expanding to include relevant columns Removing duplicates Verifying final number of rows Mutate to replace “6-8 hours” to “Strong”
#Analysis
Goal 1
Grouping by longevity of perfume, and then aggregating the AQI level, perfumes sold, and number of cities, we are able to connect the strength of a perfume to the AQI level of a city. It seems that there is a slight positive association between the two categories. The only outlier is “Light”, which we only have data from 15 cities of.
Even though the data may not be statistically significant, we would like to conclude that cities with more pollution (i.e. “smellier”) cause residents to choose stronger perfumes.
However, after running an ANOVA test to test the statistical significance of these values, we found that there is no significant relationship between the longevity of perfume worn and the AQI values of the city.
H0: Mean AQI values are the same across all longevity categories of perfume. H1: There is at least one longevity category with a different Mean AQI value.
import matplotlib.pyplot as pltimport seaborn as snsplt.clf() #clearing previous boxplotssns.boxplot( data=py_final, x='longevity', y='AQI Value', hue='longevity', #filling in boxes by longevity category order=['Light', 'Medium', 'Strong', 'Very Strong'], flierprops=dict(markerfacecolor='red') # making outlier red)plt.title('AQI Value by Longevity Category')plt.xlabel('Longevity')plt.ylabel('AQI Value')plt.show()
r_anova <-aov(`AQI Value`~ longevity, data = r_final)summary(r_anova)
Df Sum Sq Mean Sq F value Pr(>F)
longevity 3 1980 660.1 0.694 0.556
Residuals 430 409139 951.5
ggplot(r_final, aes(x = longevity, y =`AQI Value`, fill = longevity)) +#filling in boxes by longevity categorygeom_boxplot(outlier.color ="red") +#coloring outlier redtheme(legend.position="none") +#getting rid of legend since it's redundantlabs(title ="AQI Value by Longevity Category",x ="Longevity",y ="AQI Value")
There is no built-in ANOVA functions. This analysis will have to be done in only either Python, R, or Power Query.
Started by removing all columns other than longevity and AQI Value. adding index column to be able to Pivot it wide
pivoted the column to make it wide with columns Light, Medium, Strong, VeryStrong removed index column After this, we loaded this data into the Excel file and closed Power Query. We then used Analysis ToolPak in Excel to use ANOVA.
Installing Analysis ToolPak Selected Anove Single Factor Put in input range, alpha value, and output cell range final ANOVA analysis
Box Plot for the data
The p-value is 0.556. Since p > 0.05, we cannot reject H0.
Goal 2
We see that categories like “Oud”, “Leather”, and “Woody” are more popular in cities with higher average AQI levels. Comparatively, categories like “Fresh” and “Musk” are popular in cities with lower average AQI levels. It seems to us that people favor more “intense” scent notes in cities with higher pollution.
Grouped by cateogry and added column for average AQI
Upon first glance, there does not seem to be any statistical significance. We will not move forward with performing an ANOVA test for this analysis.
Conclusion
There was no statistical significance in either of our analyses to draw out conclusions regarding:
whether longer lasting perfumes are preferred in cities with higher AQIs; and
whether perfume with strong notes are preferred in higher AQI cities.
Limitations
The statistical significance of this data is low given that our data set was relatively small (at only 435 final rows), particularly in the fact that only 84 units of “Very Strong” perfume were recorded being sold to only 2 cities. We also must acknowledge the fact that the eCommerce Data “itemLocation” was not clear in specifying whether these were warehouse locations where perfumes were being stored, or if these were locations where the customers were located. We chose to interpret it as the latter. There were also locations that may have been listed as States (it was not clear), but were counted as cities when we joined the tables. For example, while there is a state named “Delaware”, there is also a city with the same name in Ohio. The global pollution data did not provide corresponding states, only the country. Additionally, the global dataset did not have a date associated with it but was uploaded 3 years ago, so we will assume that it is relatively up to date and can still be used in conjunction with the eCommerce data from 2024.
There could have also been additional data that we did not account for and may have been dropped in the AI intermediary table creation. The datasets were also not completely comprehensive: there were cities that were not listed, and popular perfume brands that were not named.
Challenge Resolution
Our primary challenge in this project was data cleaning and integration across multiple datasets. Specifically, there was no straightforward way to exactly match perfume titles across the different title columns using the techniques covered in class. The same perfume often appeared under slightly different names in the e-commerce dataset, which made direct matching difficult and pushed our work beyond standard data wrangling. To address this, we used Claude and ChatGPT to assist finding a match between the titles and perfumes. The link to the prompts and iterative process of creating the intermediary table with Claude can be found here.
Another major challenge involved differences in case sensitivity across analytic frameworks. During the analysis phase, we discovered discrepancies in the average AQI values across frameworks when grouped by longevity and category, despite having exactly same row counts in the final merged dataset. This revealed that inconsistent text casing was affecting how values were transformed. To resolve this, we converted all relevant text columns to lowercase to ensure consistent matching and equivalent treatment across all frameworks.
Takeaways
In Python, we were not as familiar with performing statistical tests and prompted Claude and Google AI for help. We discovered that we would need to install the “scipy” package into the Python Pandas environment (this is also how we learned about using the “matplotlib” and “seaborn” packages to create a customized boxplot). While not as intuitive as in RStudio, we were able to conduct the ANOVA test by separating the AQI Value data into the longevity categories, and running the ANOVA command (f_oneway) across all of them.
For SQL, we were unable to perform ANOVA directly, as the framework does not provide built-in support for this type of statistical analysis.
In RStudio, ANOVA functionality is inbuilt, allowing us to carry out the statistical analysis without relying on any additional toolkits. We also used the “ggplot” package and practice covered previously in the course through DataCamp in order to create a simple boxplot, as well as the STHDA guide to boxplots for further customization.
For Power Query, one team member revisited concepts from the Quantifying UX course taught by Dr. Jacek Gwizdka. This course covered hypothesis testing for categorical data using ANOVA and the Analysis ToolPak in Microsoft Excel. We revisited previous assignments completed for the course to familiarize ourselves with the process to conduct the test. We used pivot columns in Power Query to widen the data by longevity and then applied the ANOVA test across the four longevity categories using Data Analysis feature from Analysis ToolPak. Finally, we used Excel’s built-in chart features to visualize the results.